Abstract
This project is all about applications of SLR to real data using R. I will be utilizing a dataset that contains information about employees in a company with years of experience and their current salary. My project data is taken from an online source of sample dataset of Employees vs Salary. I will be using textbook and the information from all the labs covered until now to process and analyze the data to see how the salary changes according to the years of experience of the employee. To check if the years of experience even matters for thte hike in the salary. Goal is to apply SLR(Simple Linear Regression) to examine the relationship between the two entities.Prithviraj Kadiyala
The following was taken from Forbes Articles
There has been a lot of buzz going around in the software industry about unequal pays to the employees who have been working for a single company longer. and Who recently graduated and gets very good salary. There also have been a lot of articles written about unequal pays to the loyal employees and employees changing their jobs every 3-4 years getting almost 50% hike in salaries.
Those very new to the tech industry, with less than a year of experience, can expect to earn $50,321 (a year-over-year increase of 9.8 percent). After a year or two, that average salary jumps to $62,517 (a whooping 24.3 percent increase, year-over-year).
Spend three to five years, and the average leaps yet again, to $68,040 (a 6.3 percent increase). Between six and ten years in the industry, salaries hit $83,143 (a rise of 6.8 percent).
Breaking the ten-year mark translates into big bucks. Those with 11 to 15 years of experience could expect to pull down $96,792 (a 3.8 percent increase over last year), while those with more than 15 years average $115,399 (a 6 percent increase).
Below is the graph that shows us the salary hike when employees jump companies:
The data was collected here:
dataset = read.csv("Emp_Salary.csv",header=TRUE,sep=",")
head(dataset)
names(dataset)
## [1] "Employee" "EducLev" "JobGrade" "YrsExper" "Age" "Gender"
## [7] "YrsPrior" "PCJob" "Salary"
library(s20x)
## Warning: package 's20x' was built under R version 3.4.4
pairs20x(dataset)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
g = ggplot(dataset, aes(x = YrsExper, y = Salary, color = EducLev)) + geom_point()
g = g + xlab("Years of Experience")
g = g + geom_smooth(method = "loess")
g
The data is taken from a sample data from the website Kaggle database website
Being a computer Engineer graduate student, I was looking into how I could get the most money after graduating. I also wanted to see how much does years of experience does it take to make $100K if i work in a single company. Also if i can’t get to $100K after working for many years in a single comapany. What would it take to make that kind of money.
This data was gathered to give an idea for the job-seekers of how much salary should be expected after having some Years of Experience. Even though salary might depend on various other factors (\(salary = base + var * experience * performance\)) as mentioned in this website here.
With the prospects of working in the software industry in the future. It would really cool to analyze the working of the IT industry beforehand and be prepared with what to do and when to do given the circumstances can put me into really good perspective of getting into the market and negotiating for a higher base salary package.
I would like to look into the dataset and try to relate if the salary is dependent on years of experience directly. If not, then do we see a lot of variation in the graph as the years of xperience increases. Some people clain that only technical knowledge will take us to a position that pays us well. It’s true to some extent but we will study about how having experience will have an impact on one’s salary. So the problem we define today is going to be “Is years of experience alone enough to increase your salary or does having years and years of experience doesnt have an effect on your salary at all”.
I believe that the value of Salary (Y) tends to increase as Years of Experience (X), that is, when the employee has increasing years of experience his salary also tends to increase. I want to make a model relating the two variables to one another by drawing a line through all the points. I will define salary as my dependent variable, and years of expeirence as my independent variable.
I could use a deterministic model if all the data points were perfectly aligned and I didn’t have to worry about errors in my prediction; however, I know from the preliminary graphs that the data points are not perfectly aligned. A probabilistic model will be more accurate, in this instance, as it will take into account the randomness of the distribution of data points around the line. A simple linear regression model (hereafter referred to as SLR) is one type of probabilistic model, and will be used in my data analysis. SLR assumes that the mean value of the y data for any value of the x data will make a straight line when graphed, and that any points which deviate from the line (above or below) are equal to ??. This statement is written as:
\[ y= \beta_0 +\beta_1x_i+\epsilon \]
When \(\beta_0\) and \(\beta_1\) are unknown parameters, \(\beta_0+\beta_1x_i\) is the mean value of y for a given x and \(\epsilon\) is the random error. Working with the assumption that some points are going to deviate from the line, I know that some will be above(positive deviation) and some below(negative deviation), with an \(E(\epsilon)=0\). That would make the mean value of y: \[ \begin{align} E(y)&=E(\beta_0+\beta_1x_i+\epsilon_i)\\ &=\beta_0+\beta_1x_i+E(\epsilon_i)\\ &=\beta_0+\beta_1x_i \end{align} \]
Thus, the mean value of y for any given value of x will be represented by \(E(Y|x)\) and will graph as a straight line , with an intercept of \(\beta_0\) and a slope of \(\beta_1\)
The regression has five key assumptions:
- Linear Relationship
- Multivariate Normality
- No or little multicollinearity
- No Auto co-relation
- Homoscedasticity
In order to estimate \(\beta_0\) and \(\beta_1\) we are going to use the method of least squares.As discussed in class this helps us determine the line that best fits our data points with the minimum sum of squares of the deviations. This is called the SSE or Sum of Squares for Errors. In a straight line model, we have already discussed that \(y= \beta_0 +\beta_1x_i+\epsilon\). The estimator will be \(\hat y= \hat\beta_0 +\hat \beta_1x_i\) . The residual(the deviation of the ith valueof y from its predicted value) is calculated bvy \((y_i-\hat y_i) = y_i-(\hat\beta_0+\hat\beta_1x_i)\). Thus \(SSE=\sum^n_{i=1}[y_i-(\hat\beta_0+\hat\beta_1x_i)]\)
If the model works well with our data then we should expect that the residuals are approximately normal in distribution with mean = 0 and a constant variance.
dataset.lm=lm(Salary~YrsExper,data=dataset)
summary(dataset.lm)
##
## Call:
## lm(formula = Salary ~ YrsExper, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33054 -5782 -967 5792 30971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30329.73 1054.54 28.76 <2e-16 ***
## YrsExper 991.64 88.44 11.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8892 on 206 degrees of freedom
## Multiple R-squared: 0.379, Adjusted R-squared: 0.376
## F-statistic: 125.7 on 1 and 206 DF, p-value: < 2.2e-16
We can get the values for the Intercept(\(\beta_0\)) and the estimate(\(\beta_1\)) from the summary above. \[ \begin{align} \beta_0 &= 30329.73\\ \beta_1 &= 991.64 \end{align} \]
ciReg(dataset.lm, conf.level=0.95, print.out=TRUE)
## 95 % C.I.lower 95 % C.I.upper
## (Intercept) 28250.6592 32408.80
## YrsExper 817.2669 1166.01
\[ \begin{align} \hat\beta_0 + \hat\beta_1x_i &= 30329.73 + 991.64* x_i \end{align} \]
The least-squares estimate of the slope, \(\hat\beta_1=991.64\), indicates that the estimated amount of increase of salary increases by 992$ for every year increase, with this interpretation being valid over the range of years of experience values. We can see that the increase in salary is dependent on Years of Experience. But lets also conduct a test with a quadratic model to fit the curve better and see if that will help us get more clearer on the result we just got.
plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
main="Residual Line Segments of Salary vs YrsExper", data=dataset)
abline(dataset.lm)
plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
main="Residual Line Segments of Salary vs YrsExper", data=dataset)
ht.lm=with(dataset, lm(Salary~YrsExper))
abline(ht.lm)
yhat=with(dataset,predict(ht.lm,data.frame(YrsExper)))
with(dataset,{segments(YrsExper,Salary,YrsExper,yhat)})
abline(ht.lm)
plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
main="Mean of Salary vs YrsExper", data=dataset)
abline(dataset.lm)
with(dataset, abline(h=mean(Salary)))
with(dataset, segments(YrsExper,mean(Salary),YrsExper,yhat,col="Red"))
plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),
main="Total Deviation Line Segments of Salary vs YrsExper", data=dataset)
with(dataset,abline(h=mean(Salary)))
with(dataset, segments(YrsExper,Salary,YrsExper,mean(Salary),col="Green"))
RSS=with(dataset,sum((Salary-yhat)^2))
RSS
## [1] 16287668202
MSS=with(dataset,sum((yhat-mean(Salary))^2))
MSS
## [1] 9939439028
TSS=with(dataset,sum((Salary-mean(Salary))^2))
TSS
## [1] 26227107231
\(R^2\) is equal to \(\frac{MSS}{TSS}\), which means that the value calculated is the value for the trend line. The closer \(R^2\) is to 1, the better the fit of the trend line.
MSS/TSS
## [1] 0.3789758
This value indicates that the trend line is not a good fit for the data i have.
trendscatter(Salary~YrsExper,f=0.5,data=dataset)
Here, we use trendscatter to look and get the feel of how the data is scattered. Then according to the scatter we will do specific analysis in order to get the best results for our research interest.
We can see that the data is concentrated towards the line and there are fewer and fewer datapoints as we move away from the trend line. By this we can say that the error is distributed pretty normal. But again this is only by the visual inspection and we will ahev to perform more analysis that will give us more accurate resultss. First thig we would do is try to get a linear model and see how it looks and go forward from there.
Yrs.res=residuals(dataset.lm)
Yrs.fit=fitted(dataset.lm)
plot(dataset$YrsExper,Yrs.res, xlab="YrsExper",ylab="Residuals",ylim=c(-1.5*max(Yrs.res),1.5*max(Yrs.res)),xlim=c(0,1.6*max(Yrs.fit)), main="Residuals vs Yrsof Experience")
It looks as though the residuals are somewhat about the zero on the y-axis, but the values are mostly aligned towards the bottom of zero, this indicates that there is no SIGNIFICANT deviation from the line of best-fit.
trendscatter(Yrs.fit,Yrs.res, xlab="Fitted", ylab="Residuals")
normcheck(dataset.lm,shapiro.wilk = TRUE)
The p-value for the shapiro-wilk test is 0. The null hypothesis in this case would be that the errors are distributed normally.
\[\epsilon_i \sim N(0,\sigma^2)\]
The results of the Shapiro-wilk test indicate that we have enough evidence against to reject the null hypothesis(as the p-value is 0 compared to the standard of comparison 0.05) leading us to the conclusion that the data is not normally distributed.
quad.lm=lm(Salary~YrsExper + I(YrsExper^2),data=dataset)
plot(Salary~YrsExper,bg="Blue",pch=21,cex=1.2,
ylim=c(0,1.1*max(Salary)),xlim=c(0,1.1*max(YrsExper)),main="Scatter Plot and Quadratic of Salary vs YrsExper",data=dataset)
myplot = function(x){quad.lm$coef[1] + quad.lm$coef[2]*x + quad.lm$coef[3]*x^2}
curve(myplot, lwd = 2, add = TRUE)
Fitting a quadratic to the data does produce a visibly dissimilar result; it does not look completely linear, as we have seen prior to this during the linear model plot, but further analysis will clarify the results and allow us to make a final decision.
quad.fit = c(Yrs.fit)
plot(quad.lm, which = 1)
normcheck(quad.lm, shapiro.wilk = TRUE)
The p-value is 0. Again, the results of the Shapiro-Wilk test indicate that we DO have enough evidence to reject the null hypothesis, leading us to again assume that the data is NOT distributed normally.
summary(quad.lm)
##
## Call:
## lm(formula = Salary ~ YrsExper + I(YrsExper^2), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41717 -6013 -1285 5617 22271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36378.79 1565.92 23.232 < 2e-16 ***
## YrsExper -199.45 251.95 -0.792 0.429
## I(YrsExper^2) 38.49 7.68 5.012 1.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8413 on 205 degrees of freedom
## Multiple R-squared: 0.4468, Adjusted R-squared: 0.4414
## F-statistic: 82.77 on 2 and 205 DF, p-value: < 2.2e-16
Here we have the following values \[ \begin{align} \beta_0 &= 36378.79\\ \beta_1 &= -199.45\\ \beta_2 &= 38.49 \end{align} \]
And further we have the R-squared values in the summary whch we caan use further to compare with the linear model earlier to see which of the model fits the data better.
ciReg(quad.lm, conf.level=0.95, print.out=TRUE)
## 95 % C.I.lower 95 % C.I.upper
## (Intercept) 33291.41402 39466.17626
## YrsExper -696.19349 297.29158
## I(YrsExper^2) 23.35142 53.63647
So the equation comes out to be \[ \begin{align} \beta_0 + \beta_1*x_i+\beta_2*x^2_i&= 36378.79 -199.45*x_i + 38.49*x^2_i \end{align} \]
amount = predict(dataset.lm, data.frame(YrsExper=c(10,25,40)))
amount
## 1 2 3
## 40246.11 55120.69 69995.26
amount2 = predict(quad.lm, data.frame(YrsExper=c(10,25,40)))
amount2
## 1 2 3
## 38233.68 55451.24 89991.07
The predictions made using the first model (linear) are greater in the beginning and smaller in the end compared to the predictions made by the second model (quadratic), but they are extremely close to one another. Further comparisons will be necessary to determine which model will be the best fit for the dataset.
summary(dataset.lm)
##
## Call:
## lm(formula = Salary ~ YrsExper, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33054 -5782 -967 5792 30971
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30329.73 1054.54 28.76 <2e-16 ***
## YrsExper 991.64 88.44 11.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8892 on 206 degrees of freedom
## Multiple R-squared: 0.379, Adjusted R-squared: 0.376
## F-statistic: 125.7 on 1 and 206 DF, p-value: < 2.2e-16
summary(quad.lm)
##
## Call:
## lm(formula = Salary ~ YrsExper + I(YrsExper^2), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41717 -6013 -1285 5617 22271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36378.79 1565.92 23.232 < 2e-16 ***
## YrsExper -199.45 251.95 -0.792 0.429
## I(YrsExper^2) 38.49 7.68 5.012 1.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8413 on 205 degrees of freedom
## Multiple R-squared: 0.4468, Adjusted R-squared: 0.4414
## F-statistic: 82.77 on 2 and 205 DF, p-value: < 2.2e-16
The multiple \(R^2\) for the linear model is 0.379; the adjusted \(R^2\) is 0.376. The multiple \(R^2\) for the quadratic model is 0.4468; the adjusted \(R^2\) is 0.4414.
According to these results, both the models have a low multiple \(R^2\) value. But the quadratic linear model has a significantly higher value which tells us that the Quadratic model fits the data better than the Linear model. “Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points. Technically, least squares regression minimizes the sum of the squared residuals.” . The adjusted \(R^2\) is greater for the Quadratic model than the Linear model which tells us that the Quadratic model can fit more data into the model or can predict with more accuracy than the Linear model.
To avoid any premature decisions it is always suggested to make the final calculation to make the final decision. So we try to do an anova test to confirm if what we did was right or wrong.
anova(dataset.lm,quad.lm)
Seeing the results, we can see that the p-value is too small causing us to reject the null hypothesis as there is very large evidence against the null-hypothesis that the Linear model is better than the quadratic model. So after this confirmation we can say that Quadratic model obviously is better than the Linear model.
But again before concluding, we are going to use cook’s plot to avoid having any bias towards any kind of result.
cooks20x(quad.lm)
dataset2.lm=lm(Salary~YrsExper+ I(YrsExper^2), data=dataset[-208,])
summary(dataset2.lm)
##
## Call:
## lm(formula = Salary ~ YrsExper + I(YrsExper^2), data = dataset[-208,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -20615 -5905 -1325 5642 22410
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36933.556 1464.664 25.216 < 2e-16 ***
## YrsExper -361.632 236.892 -1.527 0.128
## I(YrsExper^2) 47.196 7.334 6.436 8.61e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7851 on 204 degrees of freedom
## Multiple R-squared: 0.5188, Adjusted R-squared: 0.514
## F-statistic: 109.9 on 2 and 204 DF, p-value: < 2.2e-16
summary(quad.lm)
##
## Call:
## lm(formula = Salary ~ YrsExper + I(YrsExper^2), data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41717 -6013 -1285 5617 22271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36378.79 1565.92 23.232 < 2e-16 ***
## YrsExper -199.45 251.95 -0.792 0.429
## I(YrsExper^2) 38.49 7.68 5.012 1.16e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8413 on 205 degrees of freedom
## Multiple R-squared: 0.4468, Adjusted R-squared: 0.4414
## F-statistic: 82.77 on 2 and 205 DF, p-value: < 2.2e-16
After looking at the summaries of the standard Quadratic model and the model after taking out the maximum cook’s value from the dataset variable. We can see that the Residual Standard Error has decreased from 8413 to 7851. And also the adjusted \(R^2\) increase from 0.4414 to 0.514 giving us a better chance at predicting new data or just making sure that more of the data is included in the curve. Also we can see that the F-statistic value increases from 82.77 to 109.9 giving us a better result in general.
Concluding the models and the results we got from the models. We did get two different graphs and predictions from two models. After analyzing the predictions and the summaries of the values after taking the cook’s plot values and deleting the most significant value in the Quadratic model. We achieved a better result in the form of better Adjusted multiple \(R^2\) value and reduced error value. We also got better predictions with the Quadratic model than the Linear model so we can say that the Quadratic model is better suited for the data we had today for our analysis. Cook’s plot gave us a clear view of some of the data that was causing the graph to favour some data points and when we took out that outlier the multiple \(R^2\) value seems to increase a bit making room for more values to be taken closer to the predicted line.
Research questin here was “Is the Years of Experience alone enough to increase your salary?” According to all of the reasearch and analysis done until now we can clearly see that the salary does increase with the increase of years of experience of the employee but we also see that the increase is not significant enough making us question that the salary is dependent on other factors as well. But with all the analysis that we have done today shows us that salary does increase when the employee has a reasonable amount of experience with him.
This experiment of predicting salary can be improved by taking many otherd factors into consideration like Degree of the employee, Proficiency of the employee, Technical knowledge, etc. The dataset we have today consists of only 200 odd number od records, so if we want to have a much more correct estimation we need to have more data regarding this matter and account for all the possible values in different categories.
Salary differences - https://insights.dice.com/2016/02/11/how-much-will-experience-increase-my-salary/
Graph of Increase in Salary - https://www.forbes.com/sites/cameronkeng/2014/06/22/employees-that-stay-in-companies-longer-than-2-years-get-paid-50-less/#1fb8fb8be07f
Assumptions of Linear Regression - https://www.statisticssolutions.com/assumptions-of-linear-regression/
Forbes Articles - https://www.forbes.com/sites/cameronkeng/2014/06/22/employees-that-stay-in-companies-longer-than-2-years-get-paid-50-less/#3ae687b8e07f
Dataset website - https://www.kaggle.com/rohankayan/years-of-experience-and-salary-dataset
Simple Linear Regression - https://onlinecourses.science.psu.edu/stat501/node/251/
Tools to calcuate salaries - https://www.recruiter.com/i/10-great-salary-calculators-to-save-you-time/
Textbook - Mendenhall, W.M. and Sincich, T.L. (2016). Statistics for Engineering and the Sciences, Sixth Edition. https://books.google.com/books?id=OHNGrgEACAAJ. isbn=9781498728850,lccn=2015041987.
Canvas - https://canvas.ou.edu/courses/84540